2,454 research outputs found

    Improved Algorithms for Time Decay Streams

    Get PDF
    In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well

    Pattern discovery and model construction: an evolutionary learning and data mining approach

    Get PDF
    In the information age, knowledge leads to profits, power and success. As an ancestor of data mining, machine learning has concerned itself with discovery of new knowledge on its own. This paper presents experiment results produced by genetic algorithms in the domains of model construction and event predictions, the areas where data mining systems have been focusing on. The experiment results have shown that genetic algorithms are able to discover useful patterns and regularities in large sets of data, and to construct models that conceptualize input data. It demonstrates that genetic algorithms are a powerful and useful learning algorithm for solving fundamental tasks data mining systems are facing today.Applications in Artificial Intelligence - Knowledge DiscoveryRed de Universidades con Carreras en Informática (RedUNCI

    The H-index of a network node and its relation to degree and coreness

    Get PDF
    Identifying influential nodes in dynamical processes is crucial in understanding network structure and function. Degree, H-index and coreness are widely used metrics, but previously treated as unrelated. Here we show their relation by constructing an operator , in terms of which degree, H-index and coreness are the initial, intermediate and steady states of the sequences, respectively. We obtain a family of H-indices that can be used to measure a node’s importance. We also prove that the convergence to coreness can be guaranteed even under an asynchronous updating process, allowing a decentralized local method of calculating a node’s coreness in large-scale evolving networks. Numerical analyses of the susceptible-infected-removed spreading dynamics on disparate real networks suggest that the H-index is a good tradeoff that in many cases can better quantify node influence than either degree or coreness.This work was partially supported by the National Natural Science Foundation of China (Grant Nos. 11205042, 11222543, 11075031, 61433014). L.L. acknowledges the research start-up fund of Hangzhou Normal University under Grant No. PE13002004039 and the EU FP7 Grant 611272 (project GROWTHCOM). The Boston University work was supported by NSF Grants CMMI 1125290, CHE 1213217 and PHY 1505000. (11205042 - National Natural Science Foundation of China; 11222543 - National Natural Science Foundation of China; 11075031 - National Natural Science Foundation of China; 61433014 - National Natural Science Foundation of China; PE13002004039 - research start-up fund of Hangzhou Normal University; 611272 - EU FP7 Grant (project GROWTHCOM); CMMI 1125290 - NSF; CHE 1213217 - NSF; PHY 1505000 - NSF)Published versio

    New Frameworks for Offline and Streaming Coreset Constructions

    Full text link
    A coreset for a set of points is a small subset of weighted points that approximately preserves important properties of the original set. Specifically, if PP is a set of points, QQ is a set of queries, and f:P×QRf:P\times Q\to\mathbb{R} is a cost function, then a set SPS\subseteq P with weights w:P[0,)w:P\to[0,\infty) is an ϵ\epsilon-coreset for some parameter ϵ>0\epsilon>0 if sSw(s)f(s,q)\sum_{s\in S}w(s)f(s,q) is a (1+ϵ)(1+\epsilon) multiplicative approximation to pPf(p,q)\sum_{p\in P}f(p,q) for all qQq\in Q. Coresets are used to solve fundamental problems in machine learning under various big data models of computation. Many of the suggested coresets in the recent decade used, or could have used a general framework for constructing coresets whose size depends quadratically on what is known as total sensitivity tt. In this paper we improve this bound from O(t2)O(t^2) to O(tlogt)O(t\log t). Thus our results imply more space efficient solutions to a number of problems, including projective clustering, kk-line clustering, and subspace approximation. Moreover, we generalize the notion of sensitivity sampling for sup-sampling that supports non-multiplicative approximations, negative cost functions and more. The main technical result is a generic reduction to the sample complexity of learning a class of functions with bounded VC dimension. We show that obtaining an (ν,α)(\nu,\alpha)-sample for this class of functions with appropriate parameters ν\nu and α\alpha suffices to achieve space efficient ϵ\epsilon-coresets. Our result implies more efficient coreset constructions for a number of interesting problems in machine learning; we show applications to kk-median/kk-means, kk-line clustering, jj-subspace approximation, and the integer (j,k)(j,k)-projective clustering problem

    Systemic risk and spatiotemporal dynamics of the US housing market

    Get PDF
    Housing markets play a crucial role in economies and the collapse of a real-estate bubble usually destabilizes the financial system and causes economic recessions. We investigate the systemic risk and spatiotemporal dynamics of the US housing market (1975–2011) at the state level based on the Random Matrix Theory (RMT). We identify richer economic information in the largest eigenvalues deviating from RMT predictions for the housing market than for stock markets and find that the component signs of the eigenvectors contain either geographical information or the extent of differences in house price growth rates or both. By looking at the evolution of different quantities such as eigenvalues and eigenvectors, we find that the US housing market experienced six different regimes, which is consistent with the evolution of state clusters identified by the box clustering algorithm and the consensus clustering algorithm on the partial correlation matrices. We find that dramatic increases in the systemic risk are usually accompanied by regime shifts, which provide a means of early detection of housing bubbles.HM, WJX, ZQJ and WXZ received support from the National Natural Science Foundation of China Grant 11075054 and 71131007, the Shanghai (Follow-up) Rising Star Program Grant 11QH1400800, the Shanghai "Chen Guang'' Project Grant 2012CG34, and Fundamental Research Funds for the Central Universities. BP and HES received support from the Defense Threat Reduction Agency (DTRA), the Office of Naval Research (ONR), and the National Science Foundation (NSF) Grant CMMI 1125290. (11075054 - National Natural Science Foundation of China; 71131007 - National Natural Science Foundation of China; 11QH1400800 - Shanghai (Follow-up) Rising Star Program; 2012CG34 - Shanghai "Chen Guang'' Project; Fundamental Research Funds for the Central Universities; Defense Threat Reduction Agency (DTRA); Naval Research (ONR); CMMI 1125290 - National Science Foundation (NSF))Published versio

    Pattern discovery and model construction: an evolutionary learning and data mining approach

    Get PDF
    In the information age, knowledge leads to profits, power and success. As an ancestor of data mining, machine learning has concerned itself with discovery of new knowledge on its own. This paper presents experiment results produced by genetic algorithms in the domains of model construction and event predictions, the areas where data mining systems have been focusing on. The experiment results have shown that genetic algorithms are able to discover useful patterns and regularities in large sets of data, and to construct models that conceptualize input data. It demonstrates that genetic algorithms are a powerful and useful learning algorithm for solving fundamental tasks data mining systems are facing today.Applications in Artificial Intelligence - Knowledge DiscoveryRed de Universidades con Carreras en Informática (RedUNCI

    Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows

    Get PDF
    We study the distinct elements and l_p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram{} along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and l_p-heavy hitters that are nearly optimal in both n and epsilon. Applying our new composable histogram framework, we provide an algorithm that outputs a (1+epsilon)-approximation to the number of distinct elements in the sliding window model and uses O{1/(epsilon^2) log n log (1/epsilon)log log n+ (1/epsilon) log^2 n} bits of space. For l_p-heavy hitters, we provide an algorithm using space O{(1/epsilon^p) log^2 n (log^2 log n+log 1/epsilon)} for 0<p <=2, improving upon the best-known algorithm for l_2-heavy hitters (Braverman et al., COCOON 2014), which has space complexity O{1/epsilon^4 log^3 n}. We also show complementing nearly optimal lower bounds of Omega ((1/epsilon) log^2 n+(1/epsilon^2) log n) for distinct elements and Omega ((1/epsilon^p) log^2 n) for l_p-heavy hitters, both tight up to O{log log n} and O{log 1/epsilon} factors

    Genetic architecture behind developmental and seasonal control of tree growth and wood properties in Norway spruce

    Get PDF
    Genetic control of tree growth and wood formation varies depending on the age of the tree and the time of the year. Single-locus, multi-locus, and multi-trait genome-wide association studies (GWAS) were conducted on 34 growth and wood property traits in 1,303 Norway spruce individuals using exome capture to cover similar to 130K single-nucleotide polymorphisms (SNPs). GWAS identified associations to the different wood traits in a total of 85 gene models, and several of these were validated in a progenitor population. A multilocus GWAS model identified more SNPs associated with the studied traits than single-locus or multivariate models. Changes in tree age and annual season influenced the genetic architecture of growth and wood properties in unique ways, manifested by non-overlapping SNP loci. In addition to completely novel candidate genes, SNPs were located in genes previously associated with wood formation, such as cellulose synthases and a NAC transcription factor, but that have not been earlier linked to seasonal or age-dependent regulation of wood properties. Interestingly, SNPs associated with the width of the year rings were identified in homologs of Arabidopsis thaliana BARELY ANY MERISTEM 1 and rice BIG GRAIN 1, which have been previously shown to control cell division and biomass production. The results provide toots for future Norway spruce breeding and functional studies

    Enhanced fatty acid production in engineered chemolithoautotrophic bacteria using reduced sulfur compounds as energy sources

    Get PDF
    Chemolithoautotrophic bacteria that oxidize reduced sulfur compounds, such as H2S, while fixing CO2 are an untapped source of renewable bioproducts from sulfide-laden waste, such as municipal wastewater. In this study, we report engineering of the chemolithoautotrophic bacterium Thiobacillus denitrificans to produce up to 52-fold more fatty acids than the wild-type strain when grown with thiosulfate and CO2. A modified thioesterase gene from E. coli (‘tesA) was integrated into the T. denitrificans chromosome under the control of Pkan or one of two native T. denitrificans promoters. The relative strength of the two native promoters as assessed by fatty acid production in engineered strains was very similar to that assessed by expression of the cognate genes in the wild-type strain. This proof-of-principle study suggests that engineering sulfide-oxidizing chemolithoautotrophic bacteria to overproduce fatty acid-derived products merits consideration as a technology that could simultaneously produce renewable fuels/chemicals as well as cost-effectively remediate sulfide-contaminated wastewater. Keywords: Chemolithoautotrophic, Sulfide, Fatty acids, tesA, Thiobacillus denitrifican
    corecore